Fix:read file encoding #3554

aheizi · 2025-05-13T14:35:19Z

Related GitHub Issue

Description

Currently, roo-code reads files by default according to utf-8. When the file encoding is GBK or others, it will cause garbled text problems

Test Procedure

manual testing

Set VSCode's default encoding to GBK
Let roo-code read this file and edit this file

Type of Change

🐛 Bug Fix: Non-breaking change that fixes an issue.
✨ New Feature: Non-breaking change that adds functionality.
💥 Breaking Change: Fix or feature that would cause existing functionality to not work as expected.
♻️ Refactor: Code change that neither fixes a bug nor adds a feature.
💅 Style: Changes that do not affect the meaning of the code (white-space, formatting, etc.).
📚 Documentation: Updates to documentation files.
⚙️ Build/CI: Changes to the build process or CI configuration.
🧹 Chore: Other changes that don't modify src or test files.

Pre-Submission Checklist

Screenshots / Videos

before:

after:

Documentation Updates

Does this PR necessitate updates to user-facing documentation?

No documentation updates are required.
Yes, documentation updates are required. (Please describe what needs to be updated or link to a PR in the docs repository).

Additional Notes

Important

Introduces readFileWithEncoding to handle multiple file encodings, replacing fs.readFile in key tools to prevent garbled text issues, and adds necessary dependencies.

Behavior:
- Introduces readFileWithEncoding in readFileWithEncoding.ts to handle multiple file encodings, including UTF-8, UTF-16, and GBK.
- Replaces fs.readFile with readFileWithEncoding in applyDiffTool.ts, insertContentTool.ts, searchAndReplaceTool.ts, and DiffViewProvider.ts to prevent garbled text issues.
Dependencies:
- Adds iconv-lite and jschardet to package.json for encoding detection and conversion.
Misc:
- Updates extract-text.ts to use readFileWithEncoding for non-binary files.

^{This description was created by}^{for 3f36e526a3a5a0f4668b4c53ff205afa3db26a33. You can customize this summary. It will automatically update as commits are pushed.}

changeset-bot · 2025-05-13T14:35:24Z

⚠️ No Changeset found

Latest commit: da3e39a025379e5fec86b15f957bb92579a8edf6

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

src/integrations/misc/readFileWithEncoding.ts

daniel-lxs

Hey @aheizi, Thank you for your contribution. I apologize we took so long to review your PR.

Looking at the whole flow, it seems like we're doing encoding detection twice - once with chardet and then again with your custom logic. Could we simplify this to just use chardet's result?

Thank you again for your contribution and patience, I'm looking forward to getting this PR ready for review.

daniel-lxs · 2025-05-28T23:11:47Z

src/integrations/misc/readFileWithEncoding.ts

I noticed this alwaysTextExtensions array is also defined in extract-text.ts but with a different format (dots vs no dots). Should we maybe centralize this list somewhere to avoid duplication?

Ok, It has been extracted as a public constant

daniel-lxs · 2025-05-28T23:15:19Z

src/integrations/misc/readFileWithEncoding.ts

Should we maybe log these errors for debugging? Silent failures could make it hard to troubleshoot encoding issues later.

This catch block handles failures during the process of attempting different encoding and decoding files. This is an expected possible situation rather than a serious error. Code design involves trying multiple encodings until the best match is found, so it is normal for some encodings to fail.
If it is to be added, I think a debug log can be added, but I haven't found any places where the debug log is used in the project.

daniel-lxs · 2025-05-28T23:15:52Z

src/integrations/misc/readFileWithEncoding.ts

For large files, wouldn't decoding the entire buffer multiple times be slow? Have you considered just trusting chardet's detection and only falling back if that fails?

Great point — decoding large buffers multiple times is definitely something to watch out for in terms of performance.

You’re absolutely right that for large files, it makes sense to avoid trying multiple encodings upfront. One improvement I’m planning is to set a file size threshold (e.g. 1MB):

For small files, we keep the current logic — try several likely encodings and score the result.

For large files, we’ll first trust chardet and decode using its result. Only if the decoded content looks suspicious (e.g. low score, unreadable characters), we’ll fallback to trying a few alternatives like gb18030.

This way we preserve accuracy for tricky cases (e.g. GBK-encoded .js or .txt files that start with mostly ASCII), while avoiding unnecessary work for large files where chardet is usually good enough.

Happy to push this change if it sounds good to you.

daniel-lxs · 2025-05-28T23:17:30Z

src/integrations/misc/readFileWithEncoding.ts

How did you arrive at the 0.05 threshold? Have you tested this with different types of files to see if this value works well across various scenarios?

I'm curious about this custom scoring system, would just trusting chardet's detection be enough in this case?

The scoring system was introduced as a safeguard against misdetections from chardet, especially in East Asian contexts. In practice, chardet often misclassifies GBK/GB18030 files as UTF-8 if the text is mostly ASCII (e.g. source code with only occasional Chinese comments). A simple confidence score from chardet doesn’t always reflect actual readability.

The scoreText function favors encodings that produce a reasonable amount of Chinese or full-width characters, and penalizes pure ASCII results. The 0.05 threshold came from empirical testing across a mix of file types:

UTF-8 Chinese content typically scores around 0.2–0.6

GBK files decoded incorrectly as UTF-8 usually get negative or near-zero scores

Pure ASCII text tends to score around -1

aheizi · 2025-05-31T09:05:26Z

Hey @aheizi, Thank you for your contribution. I apologize we took so long to review your PR.

Looking at the whole flow, it seems like we're doing encoding detection twice - once with chardet and then again with our custom logic. Could we simplify this to just use chardet's result?

Thank you again for your contribution and patience, I'm looking forward to getting this PR ready for review.

Hi, @daniel-lxs Thank you for taking the time to review this PR!

You’re right — the flow involves detecting encoding with chardet, and then trying multiple candidate encodings including the one from chardet. The reason for this is that chardet’s detection can often be unreliable, especially for short or ambiguous files (e.g. GBK-encoded .js or .txt files that contain mostly ASCII). In such cases, decoding only with chardet’s top guess can lead to misinterpretation or mojibake.

The secondary scoring pass (via scoreText) helps us choose the most plausible decoding result among a few likely encodings, particularly prioritizing those common in Chinese or East Asian contexts (utf-8, gb18030, shift_jis, etc.).

daniel-lxs · 2025-06-04T16:21:39Z

Hi @aheizi, thanks for your work on fixing the file encoding issues. This is an important area to get right.

The current approach in readFileSmart uses several custom rules and thresholds (like the scoreText function, the 0.05 scoring limit, the list of text file extensions, and the special logic for large files) to guess the file encoding.

While this might work for the cases you've tested, relying on many custom rules like this can make the solution complex and potentially unreliable as we encounter different files or new situations in the future. It can also make the code harder to understand and maintain.

We need to find a simpler and more robust way to handle file encodings. For example, we should explore:

Relying more directly on chardet's detection capabilities and its reported confidence. If chardet is uncertain, we could have a very straightforward fallback (e.g., to UTF-8, or prompt the user if that's feasible).
Investigating if we can leverage VS Code's own encoding detection mechanisms, as it's generally quite good.

The aim is to have a solution that is dependable and easier to maintain, rather than a complex system of custom checks. A clear method for common encodings with a well-defined, simple fallback is generally preferred.

Could you explore these simpler approaches?

aheizi · 2025-06-04T16:44:53Z

Hi @aheizi, thanks for your work on fixing the file encoding issues. This is an important area to get right.

The current approach in readFileSmart uses several custom rules and thresholds (like the scoreText function, the 0.05 scoring limit, the list of text file extensions, and the special logic for large files) to guess the file encoding.

While this might work for the cases you've tested, relying on many custom rules like this can make the solution complex and potentially unreliable as we encounter different files or new situations in the future. It can also make the code harder to understand and maintain.

We need to find a simpler and more robust way to handle file encodings. For example, we should explore:

Relying more directly on chardet's detection capabilities and its reported confidence. If chardet is uncertain, we could have a very straightforward fallback (e.g., to UTF-8, or prompt the user if that's feasible).

Investigating if we can leverage VS Code's own encoding detection mechanisms, as it's generally quite good.

The aim is to have a solution that is dependable and easier to maintain, rather than a complex system of custom checks. A clear method for common encodings with a well-defined, simple fallback is generally preferred.

Could you explore these simpler approaches?

Hi @daniel-lxs , thanks a lot for your thoughtful feedback — I really appreciate it.

Regarding your suggestions:
1. VS Code’s encoding detection: I also initially considered using VS Code’s built-in encoding detection. However, based on my research, the extension API (e.g., vscode.workspace.openTextDocument) doesn’t actually auto-detect encodings. It defaults to UTF-8, so unfortunately it doesn’t help much in our case.
2. Using chardet directly: I tested this as well, but found that chardet performs poorly in distinguishing between UTF-8 and GBK — the two main encodings we need to handle. Its confidence scores in these cases are often too low or misleading, which makes it unreliable on its own.

Given those limitations, I opted for the current approach, which is admittedly more complex, but has worked reliably in the cases I tested. It includes some heuristics and fallback logic that aim to cover common scenarios while still falling back to UTF-8 if detection fails.

That said, I completely agree with the goal of simplifying this logic. If we can find a more robust and maintainable way to handle encoding detection — especially one that avoids custom heuristics — I’m absolutely open to revisiting the current implementation. For now, I think this version is a step forward in terms of correctness and can serve as a foundation we can refine.

Thanks again for the suggestions — I’d love to keep the discussion going if you have any further ideas.

aheizi · 2025-06-09T10:51:42Z

@daniel-lxs In the latest submission, I referred to the implementation of encoding in vscode. https://github.com/microsoft/vscode/blob/main/src/vs/workbench/services/textfile/common/encoding.ts

ellipsis-dev · 2025-06-09T10:55:12Z

src/integrations/misc/readFileWithEncoding.ts

Consider using the built-in buffer.toString("latin1") instead of manually looping over the buffer in encodeLatin1 for improved performance and clarity.

This is related to VS Code comments.

// before guessing jschardet calls toString('binary') on input if it is a Buffer, // since we are using it inside browser environment as well we do conversion ourselves // https://github.com/aadsm/jschardet/blob/v2.1.1/src/index.js#L36-L40

daniel-lxs · 2025-06-12T21:50:26Z

Hey @aheizi, Thank you for taking the time to tackle this issue, unfortunately I don't think the implementation aligns with our goals, this adds quite a complex rating system to detect encoding and that complexity makes it hard to test and maintain.

This doesn't mean your implementation is bad or that the issue is not important, we just want a simpler solution to this issue if exists.

I'll close this issue but feel free to continue the discussion. I'll also gladly answer any questions you might have.

Thank you again!

Co-authored-by: wangyj20 <[email protected]>

github-project-automation bot added this to Roo Code Roadmap May 13, 2025

github-project-automation bot moved this to New in Roo Code Roadmap May 13, 2025

hannesrudolph moved this from New to PR [Draft/WIP] in Roo Code Roadmap May 14, 2025

aheizi marked this pull request as ready for review May 14, 2025 03:11

aheizi requested review from cte and mrubens as code owners May 14, 2025 03:11

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. bug Something isn't working labels May 14, 2025

ellipsis-dev bot reviewed May 14, 2025

View reviewed changes

src/integrations/misc/readFileWithEncoding.ts Outdated Show resolved Hide resolved

aheizi marked this pull request as draft May 14, 2025 03:43

aheizi marked this pull request as ready for review May 16, 2025 15:51

ellipsis-dev bot reviewed May 16, 2025

View reviewed changes

src/integrations/misc/readFileWithEncoding.ts Outdated Show resolved Hide resolved

mrubens added this to Roo Code Roadmap May 20, 2025

github-project-automation bot moved this to New in Roo Code Roadmap May 20, 2025

hannesrudolph moved this from New to PR [Pre Approval Review] in Roo Code Roadmap May 20, 2025

hannesrudolph moved this from PR [Needs Review] to TEMP in Roo Code Roadmap May 26, 2025

daniel-lxs moved this from TEMP to PR [Needs Review] in Roo Code Roadmap May 27, 2025

daniel-lxs added the PR - Needs Preliminary Review label May 27, 2025

daniel-lxs reviewed May 28, 2025

View reviewed changes

daniel-lxs added the PR - Draft / In Progress label May 28, 2025

hannesrudolph removed the PR - Draft / In Progress label May 29, 2025

daniel-lxs moved this from PR [Needs Prelim Review] to PR [Draft / In Progress] in Roo Code Roadmap May 29, 2025

hannesrudolph added PR - Draft / In Progress and removed PR - Needs Preliminary Review labels May 30, 2025

aheizi force-pushed the fix-file-encoding branch from da3e39a to 00b1261 Compare June 1, 2025 03:22

daniel-lxs moved this from PR [Draft / In Progress] to PR [Needs Prelim Review] in Roo Code Roadmap Jun 2, 2025

daniel-lxs added the PR - Needs Preliminary Review label Jun 2, 2025

aheizi force-pushed the fix-file-encoding branch from 6156fea to 00ffe14 Compare June 3, 2025 02:29

daniel-lxs mentioned this pull request Jun 3, 2025

can't read code file with GB2312 #4143

Closed

hannesrudolph removed the PR - Draft / In Progress label Jun 3, 2025

daniel-lxs moved this from PR [Needs Prelim Review] to PR [Draft / In Progress] in Roo Code Roadmap Jun 4, 2025

daniel-lxs marked this pull request as draft June 4, 2025 16:22

daniel-lxs added PR - Draft / In Progress and removed PR - Needs Preliminary Review labels Jun 4, 2025

aheizi force-pushed the fix-file-encoding branch 2 times, most recently from 354a6a7 to 3f36e52 Compare June 9, 2025 08:59

aheizi marked this pull request as ready for review June 9, 2025 10:53

aheizi requested a review from jr as a code owner June 9, 2025 10:53

dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Jun 9, 2025

ellipsis-dev bot reviewed Jun 9, 2025

View reviewed changes

Fix:read file encoding

3e079d6

aheizi force-pushed the fix-file-encoding branch from 3f36e52 to 3e079d6 Compare June 9, 2025 12:28

daniel-lxs moved this from PR [Draft / In Progress] to PR [Needs Prelim Review] in Roo Code Roadmap Jun 10, 2025

hannesrudolph added PR - Needs Preliminary Review and removed PR - Draft / In Progress labels Jun 11, 2025

daniel-lxs closed this Jun 12, 2025

github-project-automation bot moved this from PR [Needs Prelim Review] to Done in Roo Code Roadmap Jun 12, 2025

github-project-automation bot moved this from PR [Draft/WIP] to Done in Roo Code Roadmap Jun 12, 2025

SmartManoj pushed a commit to SmartManoj/Raa-Code that referenced this pull request Jun 13, 2025

fix the display from filename if type is windows (RooCodeInc#3554)

f16120f

Co-authored-by: wangyj20 <[email protected]>

aheizi mentioned this pull request Jul 8, 2025

When the file encoding is GBK, reading the file will result in garbled characters #3555

Closed

Fix:read file encoding #3554

Fix:read file encoding #3554

Conversation

aheizi commented May 13, 2025 • edited by ellipsis-dev bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related GitHub Issue

Description

Test Procedure

Type of Change

Pre-Submission Checklist

Screenshots / Videos

Documentation Updates

Additional Notes

Uh oh!

changeset-bot bot commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

Uh oh!

Uh oh!

daniel-lxs left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aheizi commented May 31, 2025

Uh oh!

daniel-lxs commented Jun 4, 2025

Uh oh!

aheizi commented Jun 4, 2025

Uh oh!

aheizi commented Jun 9, 2025

Uh oh!

ellipsis-dev bot Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

daniel-lxs commented Jun 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aheizi commented May 13, 2025 •

edited by ellipsis-dev bot

Loading

changeset-bot bot commented May 13, 2025 •

edited

Loading

daniel-lxs left a comment •

edited

Loading